Skip to content
This repository has been archived by the owner on Oct 23, 2024. It is now read-only.

Fix mutating admission webhook #60

Merged
merged 2 commits into from
Nov 14, 2019
Merged

Fix mutating admission webhook #60

merged 2 commits into from
Nov 14, 2019

Conversation

alembiewski
Copy link
Contributor

@alembiewski alembiewski commented Nov 13, 2019

What changes were proposed in this pull request?

This PR contains fixes for mutation admission webhook, which used to customize Spark driver and executor pods based on the specification in SparkApplication objects.

After the fix is accepted and verified, it should be applied to kudobuilder/operators#118 as well.

Why are the changes needed?

While working on DCOS-60865: Test pod affinity and tolerations support, it was discovered, that aforementioned features didn't work as expected, even though webhook was enabled by default and service was installed. Operator's description showed the following warning:

➜ k describe pod test-instance-5869fb6d47-7tqw5 -n kudo-spark-operator-testing
Warning  FailedMount  66s (x3 over 68s)  kubelet, minikube  MountVolume.SetUp failed for volume "webhook-certs" : secret "spark-webhook-certs" not found

But spark-webhook-certs was actually present:

➜ k get secrets spark-webhook-certs -n kudo-spark-operator-testing        
NAME                  TYPE     DATA   AGE
spark-webhook-certs   Opaque   4      4m19s

If we look at the AGE columns for each object, we can see the operator instance test-instance-5869fb6d47-7tqw5 was created before spark-webhook-certs secret:

image

Also, -webhook-svc-name was changed to adhere this declaration:
https://github.com/mesosphere/kudo-spark-operator/blob/2d7765b2569e551d53ed5c14ac93c232aabc52e0/kudo-operator/operator/templates/webhook-init-job.yaml#L23

After those fixes were applied, ownerReferences, affinity and tolerations appeared in Pod's spec and worked as expected:

➜  kudo-spark-operator git:(test-affinity-and-tolerations) ✗ k get pod spark-pi-driver -o yaml -n kudo-spark-operator-testing 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2019-11-13T11:07:38Z"
  labels:
    spark-app-selector: spark-f3c0ec9c38ac469cb571a9f62cd98702
    spark-role: driver
    sparkoperator.k8s.io/app-name: spark-pi
    sparkoperator.k8s.io/launched-by-spark-operator: "true"
    sparkoperator.k8s.io/submission-id: c0143db9-ce40-4d45-a184-bc6e6ef7198f
    version: 2.4.3
  name: spark-pi-driver
  namespace: kudo-spark-operator-testing
  ownerReferences:
  - apiVersion: sparkoperator.k8s.io/v1beta2
    controller: true
    kind: SparkApplication
    name: spark-pi
    uid: e1770465-4be1-49d5-9585-1b7fcd568cb0
  resourceVersion: "252853"
  selfLink: /api/v1/namespaces/kudo-spark-operator-testing/pods/spark-pi-driver
  uid: 28a87622-bde3-4bfa-9004-7036905f34a2
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: non_existing_key
            operator: DoesNotExist
        topologyKey: test
...
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: non_existing_key
    operator: Exists

How were the changes tested?

Tests from this repo
Tested locally in the scope of DCOS-60865: Test pod affinity and tolerations support

Copy link
Contributor

@rpalaznik rpalaznik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR fixes mutating admission webhook for me. 👍

Copy link
Contributor

@farhan5900 farhan5900 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@alembiewski
Copy link
Contributor Author

alembiewski commented Nov 13, 2019

As a side note, changing the order of resources in operator.yaml doesn't guarantee sequential deployment of resources, therefore the secret "spark-webhook-certs" not found warning still appears in the Events section of operator instance description, but the volume is gets mounted after some retry attempts. So the actual fix for the issue with webhook boils down to only changing the -webhook-svc-name.

Nevertheless, In order to eliminate those warnings from Pods events, we can split deploy phase to multiple tasks, given that we use serial strategy, we could achieve this depends-on-like functionality in our deployment.
Here is the example of how it will look like in kudo (number of tasks and their names are points for discussion):

➜  kudo-spark-operator git:(test-affinity-and-tolerations) ✗ kubectl kudo plan status --instance=test-instance -n kudo-spark-operator-testing
Plan(s) for "test-instance" in namespace "kudo-spark-operator-testing":
.
└── test-instance (Operator-Version: "spark-beta1" Active-Plan: "deploy")
    └── Plan deploy (serial strategy) [IN_PROGRESS]
        └── Phase deploy-spark [IN_PROGRESS]
            ├── Step deploy-sa-and-rbac (COMPLETE)
            ├── Step deploy-webhook (COMPLETE)
            ├── Step deploy-spark (IN_PROGRESS)
            └── Step deploy-spark-history-server (PENDING)

No warnings in operator events:

kudo-spark-operator git:(test-affinity-and-tolerations) ✗ k describe pod test-instance-88fb9c8ff-vnhgg 
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  51s   default-scheduler  Successfully assigned kudo-spark-operator-testing/test-instance-88fb9c8ff-vnhgg to minikube
  Normal  Pulling    50s   kubelet, minikube  Pulling image "mesosphere/kudo-spark-operator:spark-2.4.3-hadoop-2.9-k8s"
  Normal  Pulled     48s   kubelet, minikube  Successfully pulled image "mesosphere/kudo-spark-operator:spark-2.4.3-hadoop-2.9-k8s"
  Normal  Created    48s   kubelet, minikube  Created container sparkoperator
  Normal  Started    48s   kubelet, minikube  Started container sparkoperator

I have a working POC of this approach in my local branch, run a few tests on it and it looks like this is working fine (but definitely needs to be thoroughly tested in CI).
What do you guys think about this idea? Does it make sense?

@alembiewski alembiewski changed the title Fix Webhook initialization Fix mutating admission webhook Nov 13, 2019
Copy link
Contributor

@akirillov akirillov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @alembiewski. The fix LGTM 👍

@akirillov
Copy link
Contributor

@alembiewski, the idea of splitting the deploy phase into multiple steps sounds reasonable to me. We should have a discussion around the naming and find a suitable steps breakdown.

I think we shouldn't be very granular because some of the steps are optional (e.g. History Server installation) and it might look confusing if users see it in the steps if it is not enabled. I'd suggest having two bigger steps like configure-security and deploy-spark so we separate the steps based on the type of components deployed: RBAC, SAs, and certs go to the first group, and CRDs with deployments go into the second.

@alembiewski
Copy link
Contributor Author

@akirillov, sounds good, thanks!
All checks have passed, merging now.

@alembiewski alembiewski merged commit 265c957 into master Nov 14, 2019
@alembiewski alembiewski deleted the fix-webhook branch November 14, 2019 09:25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants